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Abstract 


We  have  designed  a  portable  interface  between  shared-memory  multiprocessors  and 
Standard  ML  of  New  Jersey.  The  interface  is  based  on  the  conventional  kernel  thread 
model  and  provides  facilities  that  can  be  used  to  implement  user-level  thread  packages. 
The  interface  supports  experimentation  with  different  thread  scheduling  policies  and 
synchronization  constructs.  It  has  been  ported  to  three  different  multiprocessors  and  used 
to  construct  a  general  purpose,  user-level  thread  package.  In  this  paper,  we  discuss  the 
interface  and  its  implementation  and  performance,  with  emphasis  on  the  Silicon  Graphics 
4D/380S  multiprocessor. 
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1  Introduction 


Many  applications,  such  as  window  systems,  operating  systems,  transaction  systems,  in¬ 
teractive  games,  etc.,  provide  multiple  services  to  one  or  more  clients.  To  provide  low 
latency  service,  such  programs  are  often  structured  in  a  multi-threadi  fashion.  Many  re¬ 
searchers  [11,  13,  14, 19,  25,  31,  39,  41,  48]  are  interested  in  adding  concurrency  primitives 
to  “higher-order”  languages  such  as  Standard  ML  [33]  so  that  applications  can  be  written 
in  a  multi-threaded  fashion  and  still  take  advantage  of  other  high-level  features,  such  as 
strong  typing,  garbage  collection,  modules,  etc. 

Standard  ML  of  New  Jersey  [9]  (SML/NJ)  provides  an  elegant  set  of  extensions  to  SML 
that  makes  writing  an  efficient  thread  package  for  uniprocessors  quite  easy.  Uniprocessor 
implementations  simulate  concurrent  thread  execution,  distributing  access  to  the  processor 
among  the  currently  active  threads,  typically  with  the  aid  of  a  clock-driven  preemption 
mechanism.  At  least  four  such  packages  currently  exist  [14,  31,  37,  39]. 

Many  multi- threaded  applications  offer  better  latency  if  they  are  executed  in  truly  concur¬ 
rent  fashion,  using  multiple  processors.  Moreover,  a  truly  concurrent  thread  model  can  be 
used  to  express  parallel  algorithms  designed  to  maximize  throughput.  We  have  designed 
and  implemented  a  shared-memory  multiprocessor  interface  for  SML/NJ,  called  MP,  so 
that  thread  packages  can  take  advantage  of  parallel  hardware.  Our  major  design  goals  were 
portability,  simplicity,  and  efficiency.  To  achieve  these  goals,  we  have  kept  the  MP  interface 
as  small  as  possible,  so  that  it  can  be  easily  ported  to  a  variety  of  machines  and  operat¬ 
ing  systems.  Porting  MP  to  a  new  machine  involves  implementing  only  a  small  number 
of  system-dependent  routines.  The  interface  provides  enough  functionality  so  that  thread 
packages  may  be  portably  implemented  in  ML.  This  gives  us  flexibility  to  experiment  with 
different  sets  of  thread  primitives,  scheduling  policies,  etc.,  using  the  language’s  powerful 
abstraction  mechanisms.  We  use  a  range  of  mechanisms  to  interface  ML  code  with  the 
SML/NJ  runtime  system  both  efficiently  and  cleanly. 

The  MP  interface  has  been  ported  to  the  Silicon  Graphics  (SGI)  4D/380S,  the  Omron 
Luna  88k,  and  the  Sequent  Symmetry  multiprocessors.^  Both  the  Omron  and  Sequent 
implementations  were  completed  within  a  week — a  testament  to  the  portability  of  the  in¬ 
terface.  Trivial  uniprocessor  implementations  of  MP  exist  for  all  systems  that  run  SML/NJ. 
A  portable  version  of  ML  Threads  [14]  extended  with  “safe  refs”  as  suggested  by  Appel  and 
Tolmach  [45],  has  been  built  on  top  of  the  MP  interface  and  can  be  used  as  a  basis  for 
experimenting  with  multiprocessing. 

In  this  paper,  we  first  show  how  SML/NJ  provides  appropriate  mechanisms  for  building 
uniprocessor  thread  packages.  We  then  present  our  multiprocessor  interface  and  show  how  a 
parallel  thread  package  can  be  implemented  on  top  of  it.  We  discuss  the  system-independent 
implementation  routines  amd  the  system-dependent  routines  for  the  SGI  multiprocessor.  We 
close  with  performance  results  and  some  open  problems. 

We  will  assume  that  the  reader  has  a  basic  knowledge  of  SML/NJ.  In  particular,  the  reader 
should  understand  first-class  continuations  in  order  to  follow  the  examples.  Some  knowledge 
of  operating  systems  and  their  concurrency  features  will  also  be  helpful. 

’The  Sequent  port  wm  done  by  Lorenz  Hueisbergen  of  the  University  of  Wisconsin  using  the  sal 2c 
compiler  [44]. 
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2  Terminology 
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A  machine  consists  of  some  number  of  independent  physical  processors  that  share  a  common 
memory.  The  shared  memory  might  not  be  sequentially  consistent;  i.e.,  for  a  processor  to 
guarantee  that  the  result  of  a  memory  operation  is  visible  to  other  processors,  it  may  need 
to  issue  explicit  instructions  in  addition  to  the  memory  operation  itself.  A  machine  that  has 
only  one  processor  is  called  a  uniprocessor,  while  machines  with  more  than  one  processor 
are  called  multiprocessors. 

A  task  (sometimes  referred  to  as  an  address  space),  is  an  abstract  unit  of  resource  allocation 
exported  by  the  operating  system.  A  task  typically  contains  a  view  of  a  portion  of  the 
machine’s  memory,  some  number  of  kernel  threads,  and  other  resources  {e.g.,  file  descriptors) 
that  the  OS  has  granted  the  task. 

A  thread  is  an  abstract  agent  of  computation  that  provides  the  illusion  of  a  virtual  processor. 
We  say  that  the  threads  of  a  program  multiplex  the  control  of  the  program.  Threads  and 
their  associated  operations  may  be  provided  by  the  operating  system  (kernel  threads)  or  by 
a  set  of  user-level  routines  (user-level  threads). 

A  kernel  thread  is  a  thread  that  is  provided  and  managed  by  the  OS.  Each  kernel  thread 
belongs  to  at  most  one  task  and  can  be  scheduled  to  execute  on  any  of  the  processors  at  any 
time.  In  particular,  the  OS  may  choose  to  run  the  threads  of  a  task  in  parallel.  Additionally, 
the  OS  has  the  ability  to  preempt  (i.e.,  suspend)  a  kernel  thread  at  any  time.  Each  of  a 
task’s  kernel  threads  may  access  any  memory  location  of  which  the  task  has  a  view. 

A  user-level  thread  is  a  thread  that  is  provided  and  managed  by  a  user-level  set  of  routines. 
User-level  threads  can  provide  much  better  performance  and  flexibility  in  scheduling  than 
kernel  threads  [2, 16, 30,  42, 47],  because  thread  operations  do  not  require  expensive  system 
calls.  However,  kernel  intervention  is  typically  still  required  to  obtain  access  to  multiple 
processors  and  to  manage  I/O.  Therefore,  user-level  threads  are  often  multiplexed  on  top 
of  kernel  threads;  typically,  there  is  one  kernel  thread  per  physical  processor. 

Figure  1  shows  the  typical  relationship  between  processors,  tasks,  kernel  threads,  and  user- 
level  threads.  In  SML/NJ,  operating  system  services  are  provided  via  a  runtime  system 
written  in  C;  most  services  are  exported  to  the  ML  environment  as  functions  (see  Sec¬ 
tion  5.1).  We  have  a  choice  between  implementing  user-level  threads  in  C  and  exporting 
them  to  ML,  or  exporting  an  interface  to  kernel  threads  and  implementing  user-lf'vel  threaxls 
directly  in  ML.  Since  ML  provides  excellent  facilities  for  implementing  threads  in  a  unipro¬ 
cessor  environment,  as  described  in  Section  3,  it  is  natural  to  choose  the  latter  alternative; 
the  resulting  interface  is  described  in  Section  4. 

3  An  SML/NJ  User-Level  Thread  Package 

Standard  ML  is  a  mostly  functional  programming  language  that  provides  first-class  func¬ 
tions  (closures),  compile-time  typing,  polymorphism,  exceptions,  garbage  collection,  and  a 
powerful  module  facility.  In  addition,  the  language  has  a  complete  formal  semantic  specifi¬ 
cation  [33]. 

The  Standard  ML  of  New  Jersey  implementation  [9,  7]  supports  type-safe,  first-class  con¬ 
tinuations  [17]  and  provides  asynchronous  exception  handling  facilities  in  the  form  of  signal 
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Figure  1:  Relationship  of  Tasks,  Kerne!  Threads,  and  User-Level  Threads 
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azeaption  Daq 
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val  fork  :  (unit  ->  unit)  ->  unit 

and 


Figure  2:  Queue  and  Thread  Signatures 

handlers  [38].  These  extensions  provide  everything  that  is  needed  to  build  complete,  unipro¬ 
cessor,  user-level  thread  packages  in  the  language. 

This  section  describes  the  implementation  of  a  simple  thread  package,  similar  to  the  core 
of  existing  systems  such  as  CML  [39]  and  ML  Threads  [14]. 

3.1  Thread  Management 

Wand  [46]  is  credited  with  showing  how  concurrency  can  be  simulated  using  continuations. 
Figure  3  shows  how  a  very  simple  user-level  thread  mechanism  can  be  implemented  in 
SML/NJ.  The  thread  module  is  presented  2is  a  functor,  parameterized  by  a  QUEUE  structure, 
whose  implementation  is  straightforward  and  is  not  shown  here.  The  THREAD  signature 
exports  just  the  function  fork  to  create  new  threads.  The  QUEUE  and  THREAD  signatures 
are  shown  in  Figure  2. 

Each  waiting  thread  is  represented  by  a  unit-accepting  continuation.  A  queue  (ready) 
of  continuations  is  used  to  store  threads  that  are  not  currently  running.  When  given  a 
function  child,  fork  captures  the  current  continuation  using  callcc  and  binds  it  to  parent. 
The  parent  continuation  is  placed  on  the  ready  queue,  and  child  is  executed.  When 
child  completes,  a  continuation  is  dequeued  from  ready  and  invoked  using  throe,  causing 
execution  to  continue  in  the  context  of  another  thread. 

Threads  inherit  the  environment  of  their  parent,  which  may  contain  both  mutable  and 
immutable  identifiers.  The  mutable  identifiers  form  a  shared  memory:  any  ref  cell  or  array 
accessible  from  two  threads  is  implicitly  shared  by  them.  A  variety  of  synchronization  and 
communication  mechanisms  can  be  implemented  on  top  of  this  shared  memory.  Two  such 
mechanisms,  mutex  locks  and  pipes,  are  described  in  Appendices  A  and  B. 

To  prevent  a  compute-bound  thread  from  monopolizing  the  processor,  timer  alarm  signals 
are  used  to  trigger  thread  preemption.  The  call  to  setitiaer  causes  the  signal  SIGALRM 
to  be  delivered  to  the  process  every  50  msec.  The  call  to  setHandler  installs  the  function 
switch  as  a  handler  for  the  signal.  When  an  alarm  signal  is  delivered  to  the  process,  switch 
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Innetor  CoThraad  (Queue  :  QUEUE)  :  TEREAU  = 
xtiruct 

▼al  ready  :  (unit  cent)  Queue. queue  =  Queue. create  () 

ral  atonic.llag  »  ref  false 

fun  atonic  ()  s  !atonic_flag 

fun  enter.atoaic  ()  ~  atoaic.flag  :=  true 

fun  leave.atonic  ()  =  atonic.flag  false 

fun  atomically  f  z  = 

(enter_atonic  (); 
let  val  T  ~  i  X 
in 

leave.atonic  ( ) ; 
r 

end)  handle  ezn  =>  (leave.atoaic  (); 

raise  ezn) 

val  reschedule  =  atonically  (Queue. enq  ready) 
fun  getnezt  ()  =  atonically  (Queue. deq  ready) 

fun  fork  (child  :  unit  ->  unit)  = 
callcc  (fn  parent  => 

(reschedule  parent; 
child  (); 

throw  (getnezt ())  ())) 


fun  a«itch(_,k  ;  unit  cont)  - 
if  atonic  ()  then 


else  (reschedule  k; 
getnezt  ()) 

local 

open  System. Signals  System. Tiner 
Systen.Unsaf e . CInterf ace 
val  t  s  tine  (sec  3  0.  usec^SOOOO} 
in 

val  _  s  (setHandler(SIGALRN,SONE  switch); 
setitimer(0,t,t)) 

end 


end 


Figure  3:  Implementing  Threads  in  SML/NJ 
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is  given  the  current  continuation  (k)  of  the  process.  The  handler  is  expected  to  return  a 
continuation  to  be  invoked.  Ordinarily,  saitch  enqueues  k  onto  the  ready  queue,  and 
dequeues  and  returns  another  thread’s  continuation  to  be  invoked. 

The  enq  and  deq  operations  must  be  atomic  with  respect  to  preemption;  otherwise,  the 
ready  queue  might  end  up  in  an  inconsistent  state.  A  flag  (atomic-f  lag)  is  used  to  indicate 
when  an  “atomic”  operation  is  in  progress;  switch  preempts  the  current  function  only  if 
atomic  Jlag  is  clear.  If  atomic-llag  is  set  when  the  alarm  goes  off,  switch  simply  returns 
the  continuation  that  it  was  given,  and  the  computation  proceeds  as  if  it  had  not  been 
interrupted. 

3.2  Performance  Considerations 

The  efficiency  of  this  thread  package  implementation  depends  directly  on  that  of  SML/NJ’s 
continuation  primitives.  Fortunately,  the  cost  of  creating  and  invoking  continuations  in 
SML/NJ  is  low  relative  to  overall  execution  speed,  comparable  with  the  cost  of  invoking 
an  ML  library  function.  There  are  several  reasons  for  this  efficiency.  First,  the  SML/NJ 
compiler  uses  an  intermediate  form  known  as  continuation-passing  style  that  makes  all 
continuations  in  a  program  explicit,  so  it  is  simple  for  the  compiler  to  expose  continuations 
to  the  user.  Second,  aU  closures  («.c.,  procedure-frames)  are  allocated  on  the  heap  instead 
of  on  the  stax:k.  Thus,  continuation  creation  (via  callcc)  consists  merely  of  allocating  and 
initializing  a  closure.  No  copying  from  a  stack  to  the  heap  needs  to  take  place.  Conversely, 
when  a  continuation  is  invoked  (via  throw),  no  copying  from  the  heap  back  to  the  stack  is 
necessary. 

On  the  other  hand,  SML/NJ’s  use  of  heap  allocation  for  closures  may  make  ordinary  execu¬ 
tion  slower  than  it  would  be  were  conventional  stack  allocation  used.  Although  Appel  has 
shown  that  heap  adlocation  can  compete  with  stack  allocation  for  certain  uniform  memory 
architectures  [4],  this  is  less  likely  to  be  true  for  systems  with  smaJ  first-level  caches  [24]. 
Therefore,  while  thread  operations  built  on  top  of  SML/NJ’s  continuations  should  be  effi¬ 
cient  relative  to  the  rest  of  the  ML  computation,  thread  applications  may  not  perform  as 
well  overall  under  SML/NJ  as  under  competing  systems. 

In  fact,  first-class  continuations  are  an  overly  general  mechanism  for  saving  thread  state. 
Since  any  waiting  thread  can  be  resumed  at  most  once,  it  is  permissible  to  destroy  the 
thread’s  saved  state  in  the  process  of  reinstating  it,  which  can  reduce  the  amount  of  copying 
required  by  a  stack-based  implementation  (22). 


4  The  MP  Interface 

Extending  SML/NJ  to  support  multiprocessing  involves  both  explicitly  adding  to  the  set 
of  language  primitives  and  implicitly  adapting  existing  primitives  to  the  multiprocessor 
environment.  Figure  4  shows  the  explicit  MP  multiprocessor  interface  in  the  form  of  an 
ML  signature.  This  signature  describes  the  abstract  set  of  services  available  to  an  MP  client, 
e.g.,  a  particular  thread  package.  The  main  abstraction  presented  by  the  MP  interface  is 
a  proc — a  language-level  view  of  a  kernel  thread  executing  on  a  physical  processor.  The 
interface  provides  operations  for  managing  procs  and  their  state.  Mutual  exclusion  among 
procs  is  provided  by  spin  locks. 
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signatTire  HP  = 

•ig 

type  proc.datum 

datatype  proc.state  =  PS  of  (unit  cent  *  proc.datun) 

exception  Acquire.Proc 
val  acquire.proc :  proc_atate  ->  unit 
val  releaae.proc;  unit  ->  *a 
val  active.proca:  unit  ->  int 

val  initial.datua  :  proc.datua 

val  get.datua  :  unit  ->  proc.datua 

val  set.datua  :  proc.datua  ->  unit 

type  spia.lock 
exception  Spin.Lock 
val  spin_lock:  unit  ->  spin.lock 
val  try_lock  :  spin.lock  ->  bool 
val  lock  :  spin.lock  ->  unit 

val  unlock  :  spin_lock  ->  unit 

end 


Figure  4:  The  MP  Interface 


4.1  Proc  Management 

The  proc-state  datatype  represents  the  state  of  a  proc.  It  consists  of  a  continuation  and 
a  client-defined  proc-datum.  acquire-proc  attempts  to  acquire  a  new  proc  for  the  task.  It 
takes  as  an  argument  a  proc-state  and  immediately  returns.  If  the  operation  is  successful, 
the  new  proc  is  given  the  continuation  of  the  procjstate  to  execute  in  parallel  with  the 
proc  that  called  acquira^roc.  In  addition,  the  proc-datum  portion  of  the  proc-stata  is 
installed,  as  described  below. 

Typically,  MP  is  implemented  so  that  the  number  of  available  procs  is  equal  to  the  number 
of  physical  processors  on  the  machine;  after  this  limit  is  reached,  calls  to  acquira-proc  will 
raise  the  exception  AcquiraJ»roc.  However,  the  number  of  physical  processors  available  to 
an  ML  program  typically  changes  during  a  computation  as  a  result  of  activity  by  other  users 
and  by  the  operating  system  itself.  Most  operating  systems  lack  a  mechanism  to  inform  user 
space  code  about  dynamic  changes  in  processor  availability.  Thus,  as  with  kernel  threads, 
the  correspondence  between  procs  and  available  physical  processors  is  only  an  approximate 
one. 

The  operation  release-proc  can  be  used  to  give  the  calling  proc  back  to  MP.  If  there 
is  only  one  proc  currently  running  and  it  calls  release4>roc,  then  the  entire  task  exits, 
which  might  be  undesirable.  To  help  the  client  avoid  releasing  the  last  proc,  function 
active^rocs  returns  an  integer  indicating  how  many  active  procs  the  task  currently  has. 
relaasa^roc  is  typically  implemented  by  releasing  the  current  physical  processor  to  the 
operating  system,  by  either  blocking  in  or  exiting  from  the  current  kernel  thread. 


7 


signature  NP_AR6  » 
sig 

type  proc.datum 

▼al  iuitial.datUB  :  proc.datua 

and 

Junctor  HP  (MP.Arg  :  MP_ARG)  :  MP  = 
struct 

and 


Figure  5:  The  MP  Functor 

Since  acquire_proc  and  relaase^roc  require  communication  with  the  operating  system, 
the  client  may  wish  to  invoke  them  sparingly.  To  obtain  good  performance  (at  the  ex¬ 
pense  of  other  system  users)  a  client  typically  calls  acquire^roc  repeatedly  when  it  starts 
up,  acquiring  as  many  procs  as  possible,  and  holds  on  to  them  for  the  duration  of  the 
computation. 

4.2  Per-Proc  Data 

Often,  a  proc  needs  some  small  amount  of  private  state.  For  example,  consider  the  flag 
atomic.Jlag  of  Figure  3.  This  flag  indicates  whether  the  single  proc  can  be  interrupted  or 
not.  If  we  have  multiple  procs,  then  we  need  to  have  multiple  flags,  since  some  procs  might 
be  executing  an  atomic  operation  at  the  same  time  that  other  procs  are  not. 

The  MP  interface  provides  a  single,  programmer-defined  proc-dat\im  for  each  proc.  The 
operations  get.datum  and  set.datum  allow  a  proc  to  read  and  write  its  private  datum.  The 
datum  value  for  the  initial  proc  is  given  by  initial.datum.  As  an  example,  the  atomic-flag 
for  a  proc  can  be  stored  directly  in  its  proc_datum.  If  clients  need  to  store  more  complex 
pieces  of  state,  they  can  treat  proc-datum  as  a  record  or  array.  To  support  arbitrary  datum 
values  in  a  type-safe  manner,  type  proc-datum  and  the  initial-datum  value  are  defined  as 
arguments  to  the  MP  functor,  as  shown  in  Figure  5. 

4.3  Memory  Management  and  Mutual  Exclusion 

All  live  heap  memory  is  implicitly  shared  among  all  procs;  in  particular,  a  proc  can  freely 
read  or  write  into  heap  locations  allocated  by  another  proc.  In  ML,  only  specially  declared 
ref  and  array  variables  can  be  updated  after  being  allocated,  so  all  communication  between 
procs  must  be  via  such  variables.  If  two  procs  perform  conflicting  operations  on  the  same 
data,  the  result  is  unpredictable — it  depends  on  details  of  scheduling,  relative  execution 
speed,  compiler  code  generation,  and  hardware  architecture.  One  way  to  solve  these  race 
condition  problems  is  to  use  some  form  of  mutual  exclusion  locks. 

We  provide  spin  locks  to  solve  race  condition  problems  at  the  MP  level.  Spin  locks  are 
meant  to  be  held  for  very  short  periods  of  time — so  short,  in  fact,  that  it  is  more  efficient  to 
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spin  (i.e.,  wait)  for  a  lock  than  to  do  some  sort  of  context-switch.  Every  potentially  shared, 
mutable  memory  location  should  be  protected  by  such  a  lock,  to  guarantee  that  only  one 
proc  at  a  time  can  aiccess  it. 

The  current  MP  specification  does  not  address  the  underlying  memory  consistency  model 
provided  by  the  hardware  architecture;  maintaining  the  desired  degree  of  consistency  is  the 
responsibility  of  the  MP  client.  For  the  platforms  on  which  it  is  currently  implemented,  MP 
does  expose  enough  of  the  low-level  architecture  to  allow  clienis  to  control  consistency;  on 
the  SGI,  for  example,  wrapping  ail  shared  memory  accesses — including  aligned  single-word 
writes — with  spin  lock  operations  will  suffice  to  give  the  illusion  of  sequential  consistency. 
Defining  a  uniform  higher-level  model  remains  an  important  goal  for  future  work. 

We  chose  not  to  provide  “heavier- weight”  synchronization  primitives  for  two  reasons.  First, 
most  hardware  today  provides  primitives  that  directly  support  spin  locks,  such  as  an  atomic 
test-and-set  instruction.  Second,  other  synchronization  constructs  s”ch  as  mutex  locks, 
reader/writer  locks,  semaphores,  pipes,  channels,  etc.,  can  be  easily  implemented  using  spin 
locks  in  conjunction  with  first-class  continuations;  see  Appendices  A  and  B  for  examples. 

The  operation  spin-lock  attempts  to  return  a  fresh  spin-lock  in  unlocked  state.  Imple¬ 
mentations  of  MP  are  expected  to  implement  these  locks  as  efficiently  as  possible,  e.g., 
using  hardware  support.  Since  some  machines  may  provide  only  a  limited  number  of  hard¬ 
ware  locks, ^  the  MP  implementation  is  also  permitted  to  run  out  of  locks,  in  which  case 
exception  Spin-Lock  is  raised  by  the  next  spin-lock  operation.  Thus,  thread  packages 
are  required  to  cope  with  the  possibility  of  running  out  of  locks  by  multiplexing  them  as 
necessary. 

The  operation  try_lock  attempts  to  lock  the  specified  spin  lock  and  immediately  returns  a 
bool  indicating  the  success  of  the  operation.  If  the  operation  is  successful,  the  proc  is  said 
to  “own  the  lock”.  At  most  one  proc  can  own  a  given  spin  lock  at  any  time.  The  spin  lock 
can  be  released  by  calling  the  unlock  operation;  this  need  not  be  done  by  the  proc  that  set 
the  lock. 

The  lock  operation  is  similar  to  try-lock  except  that  it  does  not  return  until  the  lock  is 
successfully  obtained.  It  is  functionally  equivalent  to  the  following  routine: 

fun  lock  si  »  while  notCtry.lock  si)  do  ()  (*  spin  *) 

lock  is  provided  in  the  interface  since  some  operating  systems  may  provide  a  more  efficient 
spin  than  the  one  shown  above  {e.g.,  by  using  backoff  techniques  [1]). 

4.4  I/O 

I/O  presents  at  least  two  problems  for  user-level  thread  packages.  The  first  affects  both 
uniprocessor  and  multiprocessor  implementations.  When  a  user- level  thread  performs  some 
blocking  I/O  operation,  it  should  not  block  the  proc  on  which  it  is  running  indefinitely  if 
there  are  other  user-level  threads  that  need  to  run.  A  thread  blocked  while  attempting  I/O 
is  similar  to  a  compute-bound  thread,  and  it  can  be  dealt  with  in  the  same  way,  by  using 
UNIX  timer  alarm  signals  to  preempt  the  thre<id.  In  the  existing  SML/NJ  implementation, 
I/O  and  signal  handling  are  coordinated  to  support  this  preemption  technique  [38].  Before 

*In  fact,  the  SGI  multiprocessor  is  the  only  machine  we  have  worked  with  that  limits  the  number  o.' 
hardware  locks. 
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a  blocking  I/O  operation  is  performed  {e.g.,  read  from  the  keyboard),  a  UNIX  select  call  is 
made.  The  select  may  be  interrupted  by  a  signal.  If  the  select  completes  without  being 
interrupted,  indicating  that  the  operation  may  be  completed  without  blocking,  signals  are 
masked  and  the  actual  operation  is  done.  If  a  signal  occurs  during  the  select,  the  current 
continuation  is  “backed-up”  to  retry  the  entire  operation,  so  it  appears  to  the  signal  handler 
as  though  the  signal  occurred  before  the  blocking  I/O  operation  was  initiated.  Thus,  no 
proc  will  be  blocked  indefinitely  on  an  I/O  operation. 

A  second  problem  affects  multiprocessor  implementations:  two  processors  may  perform  I/O 
operations  simultaneously,  possibly  accessing  the  same  runtime  data  structures.  At  present, 
MP  takes  no  specific  steps  to  prevent  such  conflicts;  the  client  is  responsible  for  protecting 
the  primitives  with  suitable  mutual  exclusion  locks. 

4.5  Signals 

We  use  the  existing  SML/NJ  signal  interface  [38],  adding  suitable  conventions  for  multi¬ 
processing.  SML/NJ  exposes  only  a  subset  of  the  UNIX  signals  to  the  ML  user;  of  these, 
we  are  primarily  concerned  with  providing  suitable  treatment  for  SIGALRM  (timer  alarm) 
and  SIGINT  (keyboard  interrupt)  signals. 

Signal  handlers  are  installed  via  a  call  to  setHandler.  The  signal  handlers  are  per-task 
resources.  That  is,  each  proc  within  a  task  shares  the  same  signal  handling  function.  How¬ 
ever,  each  proc  may  choose  to  mask  or  unmask  signals  independently  via  the  maskSignals 
operation.  When  a  signal  occurs,  it  is  delivered  to  each  proc.  If  the  proc  has  masked  its 
signals,  then  the  delivery  will  be  delayed  until  the  proc  unmasks  signals. 

There  is  no  facility  in  the  interface  for  procs  to  signal  one  another,  since  they  can  commu¬ 
nicate  through  their  shared  address  space.  MP  does  not  support  asynchronous  alerts  or 
control  operations  by  one  proc  acting  on  another,  although  these  may  be  simulated  using 
polling  in  the  target  proc.  The  timer  alarm  can  be  used  to  issue  the  poll  at  appropriate 
intervals. 

4.6  An  MP  Thread  Package 

In  this  section,  we  show  k.^w  to  extend  the  implementation  of  the  simple  thread  package 
from  Section  3  so  that  user-level  threads  can  run  in  parallel  using  the  MP  interface. 

The  ready  queue  must  be  protected  by  spin  locks  to  avoid  race  conditions  between  com¬ 
peting  procs.  Figure  6  shows  a  functor  that  takes  a  queue  implementation,  and  provides 
queues  that  are  safe  with  respect  to  multiple  procs. 

Figure  7  shows  a  functor  that  takes  a  safe  queue  implementation,  an  MP  implementation 
Vvhere  the  per-proc  data  are  bool  values  and  the  initial-object  is  false,  and  produces  a 
structure  that  matches  THREAD.  The  ready  queue  is  a  safe  unit  continuation  queue.  Thus, 
enqueuing  or  dequeuing  a  thread  is  an  atomic  operation  with  respect  to  multiple  procs.  The 
atomic-f  lag  is  replaced  by  the  per-proc  datum.  As  before,  the  switch  signal  handler  will 
only  allow  a  context  switch  to  occur  when  the  flag  is  not  set. 

For  simplicity,  the  implementation  shown  here  handles  proc  acquisition  and  release  naively; 
it  holds  procs  only  as  long  as  they’re  immediately  useful  rather  than  stockpiling  them  as 
suggested  in  Section  4.1.  The  fork  function,  when  given  a  function  to  fork,  attempts  to 
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functor  Safa_qaeu«  (structure  HP  :  NP 

structure  Queue  :  QUEUE)  :  QUEUE  s 

struct 

type  *a  queue  =  (*a  Queue. queue  *  NP.spin.lock) 
fun  create  ()  =  (Queue. create  (),  KP.spin.lock  ()) 
fun  «ith_lock  1  f  = 

(fn  X  =>  (MP.lock  1; 

let  val  r  =  f  X 
in 

HP. unlock  1; 
r 

end)  handle  exn  =>  (HP. unlock  1; 

raise  exn)) 

fun  enq  (q.l)  =  uith.lock  1  (Queue. enq  q) 

exception  Deq  =  Queue. Deq 

fun  deq  (q,l)  =  (vitb.lock  1  Queue. deq)  q 

end 


Figure  6:  Safe  Queues 

acquire  a  proc  to  run  the  parent’s  continuation.  If  the  call  to  acquire4>roc  fails,  the  parent 
is  rescheduled  onto  the  ready  queue  and  the  child  function  is  evaluated.  When  the  child 
completes,  a  continuation  is  atomically  dequeued  from  ready  using  getnext  and  invoked. 
If  the  exception  Deq  is  raised,  there  are  no  more  threads  to  run,  so  the  calling  proc  releases 
itself. 

Figure  8  shows  how  the  system  may  be  linked,  given  the  MP  functor  and  a  structure  Queue. 


5  MP  Generic  Implementation  Details 

The  implementation  of  the  MP  interface  is  divided  into  a  generic  system-independent  layer, 
and  a  system-dependent  layer.  The  generic  layer  makes  up  the  bulk  of  an  implementation, 
allowing  the  interface  to  be  ported  easily. 

5.1  Interfacing  ML  code  to  the  Outside  World 

In  order  to  track  the  changes  that  are  being  made  to  the  SML/NJ  system’s  development, 
we  tried  to  cause  as  little  disturbance  as  possible  while  adding  MP  support.  Internally, 
SML/NJ  supports  a  range  of  mechanisms  for  access  to  the  underlying  hardware  and  operat¬ 
ing  system  [6,  7].  Each  mechanism  involves  a  different  tradeoff  between  execution  overhead 
and  portability. 


Primops:  The  compiler  generates  generic  machine  code,  which  is  then  translated  into 
machine- specific  instruction  sequences.  The  generic  machine  model  includes  general-purpose 
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functor  NP.ThroadCstmctura  HP  :  HP 

stmctnro  SafoQ  :  QUEUE 

sharing  typo  HP.proc.datnn  =  bool)  s 

struct 

structure  SafeQ  =  SafeQ 

val  ready  :  (unit  cont)  SafeQ. queue  =  SafeQ. create  () 

fun  atonic  ()  =  HP . get.datun  () 

fun  enter.atoaic  O  -  NP.set.datun  true 

fun  leave.atoaic  ()  -  NP.set.datun  false 

fun  atonically  f  » 

(fn  z  =>  (enter.atonic  (); 

let  val  r  =  f  z 
in 

leave. atonic  (); 
r 

end)  handle  ezn  =>  (leave.atonic  (); 

raise  ezn)) 

val  reschedule  =  atonically  (SafeQ. enq  ready) 
fun  getnezt  ()  =  (atonically  SafeQ. deq)  ready 

handle  SafeQ. Deq  =>  MP . release.proc  () 


fun  fork  child  = 

(callcc  (fn  parent  ’=> 

(HP . acquire.proc  (NP . PS(parent , true) ) 
handle  NP . Acquire.Proc  3> 

reschedule  parent; 

child  (); 

throw  (getnezt  ())  ()))) 


fun  snitch  (_,k  :  unit  cont)  = 
if  atonic  ()  then  k 
else  (reschedule  k; 
getnezt  ()) 

local 

open  Sy St en. Signals  Systen.Tiner 
Systen . Unsafe . CInterf ace 
val  t  s  TINE  (sec  =  0,  usec=60000} 
in 

val  _  =  (satHandIer(SIGAIJUf,SONE  snitch); 
setitiner(0 ,t , t) ) 

end 


end 


Figure  7:  MP  Threads 
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ctructnra  IIP_Arg  :  MP_iRG  = 
struct 

type  proc.dstuB  =  bool 
val  initial.datuB  s:  lalss 
end 

structure  NP  =  IfP(NP_JLrg) 

structure  SaleQ  ~  Sale.Queue  (structure  HP  =  NP 

structure  Queue  =  Queue) 

structure  Thread  :  THREAD  =  NP_Thread( structure  NP  s  NP 

structure  SaleQ  =  SaifeQ) 


Figure  8:  Linking  the  System 

registers  and  transfer  operations,  and  also  a  set  of  primitive  operators  (primops)  for  perform¬ 
ing  specific  arithmetic  and  logical  operations  and  other  specialized  tasks,  such  as  callcc. 
Typically,  each  primop  translates  into  a  different  sequence  of  native  machine  instructions 
for  eawrh  supported  architecture.  It  is  relatively  easy  to  add  new  primops,  provided  a  suit¬ 
able  implementation  is  available  for  each  architecture,  and  depends  only  on  the  architecture 
(and  not,  e.g.,  the  operating  system). 

Assembly  functions:  The  code  sequences  for  some  common  operations,  such  as  string 
allocation  and  certain  arithmetic  functions,  are  too  lengthy  to  generate  in  line.  Instead, 
they  are  provided  as  assembly-level  functions  that  can  be  invoked  directly  from  generated 
ML  code.  Invoking  such  a  function  is  similar  in  cost  to  calling  an  ML  library  function;  in 
particular,  a  return  closure  for  the  function  call  must  be  constructed.  Assembly  functions 
can  have  variant  implementations  for  different  architectures  and  different  operating  systems. 


C  functions:  The  runtime  system  for  SML/NJ  is  written  in  C  and  provides  a  coroutine 
interface  to  ML  code.  When  ML  requires  a  runtime  service,  such  as  garbage  collection, 
it  sets  a  global  variable  request  to  a  value  indicating  which  service  is  desired,  saves  its 
register  set  in  a  global  state  vector,  loads  the  C  registers  from  the  C  stack  and  begins  to 
execute  the  C  code  that  will  provide  the  service.  When  the  runtime  service  is  complete, 
the  C  registers  are  saved  on  the  C  stack,  the  ML  registers  re-loaded,  and  execution  of  ML 
code  continues.  Two  assembly  language  routines,  saveregs  and  restoreregs,  handle  the 
machine-specific  task  of  crossing  this  ML/C  boundary. 

The  most  important  runtime  services  are  garbage  coUection  (see  Section  5.5)  and  signal 
processing,  which  have  their  own  request  values.  Other  services,  such  as  I/O,  are  provided 
via  named  C  functions.  New  functions  can  be  added  easily  at  linktime;  in  fact,  function 
names  are  not  bound  to  code  addresses  until  runtime. 
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5.2  Proc  Implementation 

Each  proc  is  allowed  to  execute  both  ML  and  C  code.  The  obvious  alternative — restricting 
C  execution  to  a  single  server  thread — would  introduce  unnecessary  synchronization.  Each 
proc  is  given  an  HLState  vector  that  holds  per-proc  copies  of  runtime  flags  and  values. 
While  executing  ML  code,  a  pointer  to  the  state  vector  sits  on  top  of  the  C-stack.  When 
crossing  from  ML  to  C,  the  pointer  is  popped  off  the  stack  and  passed  as  an  argument 
to  saveregs.  This  routine  passes  the  state  pointer  to  the  runtime  service  routine  as  well. 
When  the  service  is  complete,  the  pointer  is  passed  to  restoreregs,  which  places  it  on  top 
of  the  stack.^ 

In  a  C  signal  handler,  it  is  difficult  to  recover  the  the  state  pointer  from  the  stack,  since 
a  new  frame  is  pushed  on  the  stack.  Thus,  eaudi  state  vector  contains  the  kernel  thread  id 
of  the  proc  that  owns  it.  While  executing  a  C  signal  handler,  the  proc  calls  the  system- 
dependent  routine  ml-getpid  to  determine  its  identity.  It  then  walks  through  the  list  of 
state  vectors  until  it  flnds  the  one  with  its  id.  While  slow,  this  method  is  portable. 

Implementing  acquir«_proc  and  releasa-proc  requires  access  to  the  operating  system, 
so  these  routines  are  implemented  as  C  functions.  Since  the  ML  runtime  system  must 
allocate  data  structures  for  each  process,  a  compile-time  constant  determines  the  maximum 
number  of  procs  that  the  runtime  system  can  provide.  When  a  client  calls  acquira^roc, 
the  runtime  system  first  attempts  to  And  an  available  NLStata  vector  and  initializes  it 
appropriately.  If  the  blocked  pool,  described  below,  is  empty,  the  runtime  system  invokes 
a  system-dependent  routine,  new.proc,  which  in  turns  calls  the  operating  system,  passing 
the  continuation  to  be  executed  and  the  state  vector  as  arguments.  The  operating  system 
creates  a  new  kernel  thread  and  starts  it  executing  the  continuation. 

When  a  proc  calls  release-proc,  the  runtime  system  puts  the  current  kernel  thread  to  sleep 
using  the  system-dependent  routine  block,  making  the  current  physical  processor  available 
for  other  system  users.  The  blocked  kernel  thread  is  placed  in  a  pool  for  subsequent  reuse. 
If  the  pool  is  non-empty  when  acquire^roc  is  called,  the  runtime  system  will  remove  a 
sleeping  thread  from  the  pool  and  wake  it,  using  the  system-dependent  routine  unblock, 
instead  of  asking  the  operating  system  for  a  new  kernel  thread.  Under  some  operating 
systems,  if  the  runtime  system  acquired  and  released  kernel  threads  whenever  acquire-proc 
and  release_proc  were  called,  it  would  rapidly  run  out  of  available  thread  (process)  ids. 

5.3  Per-Proc  Data  Implementation 

Since  it  is  very  important  to  access  the  proc_datuB  as  efficiently  as  possible,  the  SML/NJ 
generic  machine  model  was  modified  to  include  a  dedicated  virtual  register  to  hold  the 
proc_datUB.  Two  primops  corresponding  to  get -datum  and  set  .datum  were  added  to  read 
and  write  the  register.  On  RISC  machines  that  have  32  or  more  registers,  like  the  MIPS- 
based  SGI,  dedicating  a  register  for  per-proc  data  does  not  affect  performance  [7,  page  189], 
so  the  virtual  register  is  implemented  by  an  actual  register.  On  architectures  with  fewer 
registers  (e.g.,  VAX),  the  datum  is  stored  on  the  stack  and  accessed  indirectly  through  the 
stack  pointer. 

^Th«  idea  of  passing  the  state  pointer  around  in  the  runtime  system  is  due  to  Andrew  Appel.  An 
OS-dependent  solution  to  this  problem  was  used  in  the  SML/Mach  runtime  system  [14]. 
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5.4  Spin  Lock  Implementation 

Each  spin  lock  is  represented  by  one  or  more  memory  locations.  On  some  machines,  any 
location  can  be  used;  exclusive  access  is  guaranteed  by  using  specially  interlocked  exchange 
or  test-and-set  instructions.  In  this  case,  spin  locks  can  be  allocated  in  the  heap,  and 
garbage  collected  in  the  usual  way.  Other  machines  provide  a  special  bank  of  memory  for 
the  locks,  and  only  a  Axed  maximum  number  of  locks  are  available.  In  this  case,  we  do  not 
bother  garbage  collecting  the  lock  memory,  since  the  MP  client  must  already  cope  with 
the  possibility  of  a  restricted  number  of  locks;  if  maximizing  the  number  of  available  locks 
proves  desirable,  modifying  the  collector  to  reclaim  unreachable  locks  should  not  be  too 
difficult. 

The  spin  lock  operations  try -lock  and  unlock  should  be  as  fast  as  possible.  This  suggests 
adding  new  primops  to  the  compiler.  However,  while  primop  implementation  may  depend 
on  the  target  architecture,  SML/NJ  currently  has  no  provision  for  generating  code  based 
on  the  tsirget  operating  system.  Therefore,  we  implement  these  operations  using  assembly 
language  routines. 

5.5  Storage  Management 

SML/NJ  performs  heap  allocation  very  frequently  (approximately  one  word  per  every  3-7 
instructions  [7,  page  196]).  Allocation  is  in-lined  by  the  compiler,  making  it  quite  fast.  An 
allocation  consists  of  incrementing  a  top-of-heap  pointer  and  writing  the  values.  A  check 
for  heap  overflow  is  made  at  the  entrance  of  each  code  tree;*  if  there  is  insufficient  free  space 
to  contain  the  maximum  allocation  that  the  code  tree  might  perform,  a  trap  instruction 
is  used  to  enter  the  C  runtime  system  and  initiate  garbage  collection.  SML/NJ  uses  a 
two-generation,  copying  garbage  collector  [5]. 

In  adapting  this  system  to  a  multiprocessor,  it  is  important  to  avoid  proc  synchronization 
during  allocation;  this  requirement  precludes  allocating  into  a  global  heap.  Instead,  the 
entire  allocation  region  is  divided  into  per-proc  regions  at  startup;  each  proc  allocates  into 
its  own  region.  All  regions  can  be  read  or  updated  by  the  other  procs,  however.  When 
the  allocation  region  is  fllled  and  a  garbage  collection  (GC)  is  required,  the  procs  are 
synchronized,  the  collection  is  performed,  and  the  allocation  region  is  redivided.  Procs  in 
the  blocked  pool  are  given  a  small,  fixed-size  region.  The  rest  of  the  allocation  area  is 
divided  evenly  among  the  active  procs. 

In  practice,  procs,  and  hence  processors,  tend  to  allocate  at  different  rates.  When  redividing 
the  allocation  region  after  a  GC,  we  could  try  to  give  each  proc  a  region  roughly  proportional 
to  its  allocation  rate.  However,  the  previous  allocation  rate  of  a  proc  can  be  a  poor  predictor 
of  the  future  allocation  rate,  especially  if  procs  frequently  switch  context.  Since  garbage 
collections  could  be  a  serious  performance  bottleneck,  we  want  to  make  sure  that  all  of  the 
available  space  in  the  allocation  area  is  used  before  initiating  a  GC. 

To  address  this  problem,  we  further  divide  each  proc’s  region  into  chunks.  Initially  each 
proc  is  given  the  first  chunk  in  its  region  as  its  allocation  area.  When  a  proc’s  current  chunk 
does  not  have  enough  space  left,  the  proc  attempts  to  extend  its  current  chunk  by  acquiring 
the  next  contiguous  chunk  in  its  region.  This  prevents  the  proc  from  wasting  space  that  is 
left  in  the  current  chunk.  If,  however,  there  are  no  more  chunks  in  the  proc’s  region,  the 

code  tree  ia  an  acyclic  set  of  blocks  with  one  entry  point  and  one  or  more  exits. 
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proc  attempts  to  find  and  “steal”  some  free  chunks  at  the  end  of  another  proc’s  region.  If 
no  more  chunks  are  available,  then  the  proc  initiates  a  garbage  collection. 

Setting  the  chunk  size  appropriately  involves  some  tradeoffs.  Setting  it  too  large  could 
keep  a  fast  allocator  from  finding  available  chunks  to  steal,  causing  garbage  collection  to 
be  initiated  more  frequently.  Setting  the  chunk  size  too  small  forces  the  procs  to  trap  and 
look  for  chunks  more  often. 

Our  current  implementation  does  sequential  garbage  collection.  When  one  proc  finds  that 
no  more  chunks  aire  available,  it  sends  a  signal  (SIGUSRl)  to  the  other  procs,  using  the 
system-dependent  routine  signal^proc.  These  procs  note  the  signal,  set  their  heap- limit 
pointer  to  a  special  value,  and  continue  what  they  were  doing.  When  they  enter  the  next 
code  tree,  and  are  thus  in  a  clean  state,  the  special  value  in  their  heap-limit  pointer  will 
cause  them  to  fault  and  enter  the  GC  routine.  When  all  the  procs  have  checked  in,  one  of 
them  collects  all  of  the  roots  and  does  the  actual  garbage  collection  while  the  others  block 
(using  the  block  routine).  When  the  collection  is  done,  the  new  allocation  area  is  divided 
among  the  procs  and  the  blocked  procs  are  resumed  (using  the  unblock  routine). 

The  synchronization  code  for  garbage  collection  is  by  far  the  most  complex  change  we  made 
to  the  SML/NJ  system.  Fortunately,  the  bulk  of  this  code  is  implemented  at  the  generic 
layer. 


6  The  SGI  System-Dependent  Layer 

This  section  describes  the  machine  dependent  layer  of  the  MP  implementation  for  the  Sili¬ 
con  Graphics  4D/380S  multiprocessor  running  version  3.3  of  the  IRIX  System  V  operating 
system.  This  machine  uses  one  to  eight  40MHz  MIPS  R3000  processors,  and  supports  up  to 
256  MB  of  main  memory.  Each  processor  has  a  64KB  direct-mapped  primary  instruction 
cache,  a  64KB  direct-mapped  write-through  primary  data  cache,  and  a  1MB  secondary 
snoopy  data  cache.  Test-and-set  locks  are  provided  through  special  hardware,  described  in 
more  detail  below. 

6.1  Procs  and  Processes 

IRIX  does  not  provide  kernel  threads  per  se,  but  does  provide  the  ability  for  processes 
(single-threaded  tasks)  to  share  the  same  address  space.  Processes  that  share  an  address 
space  form  a  process  group.  We  map  each  MP  proc  to  a  distinct  IRIX  process.  The  sproc(2) 
system  call  is  used  to  implement  nev^roc.  The  blockproc(2)  and  unblockproc(2)  system 
calls  are  used  to  implement  block  and  unblock,  respectively. 

One  advantage  of  using  processes  (as  opposed  to  kernel  threads)  as  procs  is  that  the  interface 
to  IRIX  signals  is  uniform;  thus,  no  SML/NJ  runtime  changes  were  needed  to  support  MP 
signals  at  the  system-dependent  layer  for  the  SGI. 

6.2  Spin  Locks 

The  MIPS  R3000  does  not  have  a  test-and-set  instruction.  However,  the  SGI  provides  a 
limited  number  of  hardware  locks,  which  are  implemented  by  a  separate  lock  memory  and 
bus,  which  is  mapped  to  a  special  page  of  memory  called  a  synchronization  arena.  Reading 
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from  a  location  in  the  arena  performs  a  test-and-set  operation  and  returns  the  result  of  the 
test.  Writing  an  arbitrary  value  to  the  location  clears  the  lock.  A  system  call  (usinit(3))  is 
needed  to  place  the  page  in  the  process  group’s  address  space. 

At  most  4096  locks  may  be  allocated  from  a  synchronization  arena.  The  runtime  system 
reserves  a  small  number  of  locks  for  its  own  use  and  exports  the  rest  to  MP.  The  runtime 
system  keeps  an  array  of  pointers  to  the  exportable  locks.  When  a  spin-lock  call  is  made, 
the  runtime  system  returns  one  such  pointer  as  the  result  of  the  operation.  This  extra 
indirection  is  needed  to  support  stand-alone  or  exported  images  of  programs,  since  the 
operating  system  does  not  gurantee  to  place  the  synchronization  arena  in  the  same  virtual 
address  range  for  each  invocation  of  a  program. 

It  is  possible  to  “trick”  the  SML/NJ  compiler  to  generate  the  proper  sequence  of  instructions 
for  the  try-lock  and  unlock  operations,  since  they  are  just  reads  and  writes: 

fun  try.lock  (x  :  spin.lock)  » 

System. Unsafe. boxed(! (! (System.Unsaf e.cast  x))) 

fun  unlock  (x  :  spin.lock)  ■  (! (System. Unsafe. cast  x))  :•  0 

The  boxed  predicate  returns  true  if  its  argument  is  0  (considered  as  a  machine  word).  This 
approach  allows  us  to  gain  the  benefits  of  inlining  without  having  to  add  a  new  primop,  at 
the  expense  of  introducing  machine  dependencies  at  the  ML  level. 

6.3  Performance 

In  this  section,  we  present  preliminary  measurements  for  the  SGI  implementation  of  MP 
and  a  simple  thread  package  built  on  top  of  it.  A  proper  performance  analysis  of  MP 
should  compare  it  with  other,  perhaps  less  portable,  mechanisms  for  implementing  threads 
on  multiprocessors. 

Figure  9  gives  times  (in  /isec)  for  selected  MP  and  thread  primitive  operations.  Thread  tests 
used  an  ML  Threads  package  implemented  on  top  of  MP.  The  package’s  implementation  is 
similar  to  the  one  described  in  Figures  7  and  12;  mutex  locks  are  enhanced  to  record  the  id 
of  the  owning  thread  and  check  that  the  unlock  is  performed  by  the  owner,  but  association 
of  mutexes  to  references  remains  informal.  Spin  locks  are  implemented  using  the  in-line 
ML  code  shown  in  Section  6.2.  Thread  package  code  and  the  ML  portion  of  MP  code  are 
placed  in  the  same  compilation  unit  as  the  test  code  to  encourage  in-lining.  Tests  used 
100  MB  of  main  memory;  all  times  include  garbage  collection.  Figures  presented  are  the 
averages  over  thousands  of  trials;  loops  were  unrolled  to  minimize  overhead. 

Figure  10  gives  running  times  (in  seconds)  for  five  applications  running  on  an  8  processor 
SGI  using  100  Mbytes  of  main  memory.  The  same  thread  package  was  used  as  in  Figure  9. 
Each  reported  figure  is  the  average  of  at  least  three  trials. 

The  benchmarks  were; 

•  allpairs:  Floyd’s  algorithm  for  computing  all  shortest  paths  between  two  nodes  of 
a  100  node  graph  [34]. 

•  mat:  Computes  the  minimum  spanning  tree  on  500  randomly  distributed  points  using 
Prim’s  algorithm  [34]. 
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Primitive  operation 

Time  (/zsec) 

Get  and  set  proc_datuffl. 

0.07 

Lock  and  unlock  spin  lock. 

5.7 

Unsuccessfully  try  to  lock  spin  lock. 

1.12 

Lock  and  unlock  mutex. 

40.7 

Unsuccessfully  try  to  lock  mutex. 

10.6 

Fork  null  thread  and  wait  for  it  to  terminate. 

287 

Synchronize  with  another  thread  (“pingpong”). 

232 

Voluntarily  yield  control  to  another  thread. 

32.5 

Acquire  and  release  proc  (using  8  processors). 

80.2 

Figure  9:  Primitive  Timings 


Processors  Used 

Prog 

1 

2 

3 

4 

5 

6 

7 

8 

allpairs 

12.60 

10.25 

9.53 

9.57 

9.69 

9.71 

mat 

gwRii 

30.25 

23.11 

22.79 

m 

21.18 

21.39 

22.80 

abisort 

20.53 

12.06 

9.44 

8.44 

8.17 

8.35 

8.69 

simple 

58.34 

37.12 

32.61 

32.17 

32.91 

33.40 

33.65 

34.24 

matrix 

26.64 

13.51 

8.96 

6.76 

5.40 

4.50 

3.89 

3.40 

Figure  10:  Thread  Benchmarks 
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•  abisort:  An  adaptive  bitonic  sorting  algorithm  [12]  of  2*^  integers  [34]. 

•  siople:  The  Simple  hydrodynamics  benchmark  [15],  which  solves  a  set  of  differential 
equations  across  a  grid  of  size  100  x  100,  run  for  one  time  step. 

•  aatriz:  Matrix  multiply  of  two  250  x  250  integer  matrices. 

Except  for  matrix,  these  applications  show  very  poor  relative  speedup,  being  unable  to 
make  productive  use  of  more  than  3  processors.  We  conjecture  that  this  poor  result  is 
due  almost  entirely  to  main-memory  bus  contention.  Independent  measurements  indicate 
that  the  bus  has  a  maximum  achievable  bandwidth  of  about  30  MB/sec;  most  of  these 
benchmarks  (with  the  notable  exception  of  matrix)  generate  15-20  MB/sec  of  bus  traffic 
in  allocation  alone!  We  suspect  that  a  large  performance  improvement  can  be  made  only 
by  moving  to  a  different  memory  allocation  strategy  (see  Section  8). 


7  Related  Work 

There  has  been  a  lot  of  interest  in  adding  threads  to  ML,  but  we  are  only  aware  of  two 
projects  that  have  addressed  running  the  threads  in  parallel  on  multiprocessors.  Cooper  and 
Morrisett  [14]  describe  a  Mach-based  multiprocessor  thread  implementation  for  SML/NJ, 
which  was  the  basis  for  our  MP  work.  The  SML/Mach  interface  provides  fork,  mutex 
locks,  condition  vairiables,  and  per-thread  ref  cells  (vars).  At  the  time  SML/Mach  was 
developed,  the  VAX-6300  was  the  only  multiprocessor  running  Mach  for  which  an  SML/NJ 
code  generator  existed.  To  increase  the  availability  of  the  ML  Threads  package,  we  decided 
to  rethink  the  interface  to  acheive  portability  across  operating  systems  and  machines.  An¬ 
other  shortcoming  of  the  SML/Mach  interface  is  that  it  does  not  provide  a  mechanism  for 
preemptively  scheduling  user-level  continuation  threads.  Also,  the  interface  is  designed  to 
support  only  one  thread  package  (ML  Threads)  directly  and  we  wanted  to  support  other 
thread  packages,  notably  CML. 

Matthews  [31]  describes  several  implementations  of  a  concurrent  extension  of  Poly/ML  us¬ 
ing  CSP-style  communication  operators  [11].  The  concurrency  primitives  are  implemented 
as  hardwired  extensions  to  the  Poly/ML  runtime  system.  Matthews  describes  a  unipro¬ 
cessor  implementation  using  timer  preemption,  an  implementation  for  the  DEC  Firefly 
multiprocessor,  and  the  outline  for  a  distributed  workstation  implementation.  The  Fire¬ 
fly  implementation  is  comparable  to  our  MP  approach;  in  particular,  garbage  collection 
is  arranged  in  very  similar  fashion.  However,  support  for  multiple  threads  and  for  com¬ 
munication  between  them  is  buried  deeper  in  the  runtime  system  and  is  not  designed  for 
portability  among  different  thread  models  or  hardware  platforms. 

The  Lisp  community  has  a  great  deal  of  experience  with  multiprocessor  symbolic  computing. 
Multilisp  [21]  and  Qlisp  [18]  are  two  languages  that  provide  concurrency  extensions  to  Lisp 
and  have  been  implemented  for  various  multiprocessors  [20,  28,  32,  43].  Multilisp  provides 
futures  as  its  primary  concurrency  mechanism.  Evaluation  of  the  expression  (future  X) 
forks  a  thread  to  evaluate  X.  A  capability  for  X’s  value  is  immediately  returned  to  the  caller. 
When  the  value  of  X  is  needed,  the  capability  can  be  touched,  at  which  point  the  caller 
is  delayed  until  the  evaluation  of  X  is  complete.  Qlisp  also  provides  futures  along  with  a 
variety  of  mechanisms  including  Qlet  and  mutex  locks.  Qlet  evaluates  the  right-hand  sides 
of  its  bindings  in  parallel. 
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These  versions  of  Lisp  only  support  one  thread  model  (e.j.,  futures)  and  are  not  extensible. 
A  great  deal  of  work  has  gone  into  determining  how  to  schedule  the  threads  and  how  to 
bound  the  number  of  threads  in  the  system.  For  instance,  some  implementations  allow  the 
user  to  select  from  a  fixed  set  of  scheduling  policies  (FIFO  or  LIFO).  Other  implementations 
turn  parallel  code  into  sequential  code  when  the  compiler  can  determine  that  doing  so  will 
improve  performance.  Our  approach  provides  more  flexibility;  MP  clients  can  tailor  a 
thread  package’s  scheduling  policy  and  mechanisms  and  its  synchronization  constructs  to 
meet  the  needs  of  particular  applications. 

Some  existing  concurrent  systems  do  stress  flexibility  and  portability  at  the  runtime-system 
level.  The  primary  goals  of  the  Portable  Common  Runtime  system  (PCR)  are  to  provide 
portability  and  language  inter-operability  [47].  PCR  offers  storage  management,  symbol 
binding  (t.e.,  dynamic  loading),  threaxls,  and  low-level  I/O  to  a  language  implementor.  The 
implementation  of  threads  under  PCR  is  similar  to  ours — user-threads  are  multiplexed  on 
top  of  kernel  threads.  However,  PCR  does  not  allow  the  thread  package  or  its  scheduling 
policy  to  be  customized.  Since  PCR  must  work  with  a  variety  of  languages,  it  uses  a 
conservative,  non-copying  garbage  collector. 

The  goals  of  Jagannathan  and  Philbin’s  STING  system  [26,  27]  closely  resemble  ours. 
STING  is  a  dialect  of  Scheme  enhanced  with  primitive  concurrency  operators  to  form  a 
substrate  for  custom  design  of  concurrent  symbolic  computing  environments.  The  basic 
data  types  are  threads  and  virtual  processors;  the  system  provides  flexibility  in  scheduling, 
storage  allocation  and  thread  migration  policies  without  compromising  efliciency.  A  major 
difference  from  our  work  is  that  we  rely  on  SML/NJ’s  first-class  continuations  to  represent 
threads;  STING ’s  thread  data  type  is  less  elegant  but  more  carefully  crafted  for  perfor¬ 
mance.  Also,  STING ’s  design  does  not  emphasize  portability  among  different  hardware 
and  operating  system  platforms. 


8  Discussion  and  Future  Work 

There  is  much  work  to  be  done  regarding  the  MP  interface.  Most  importantly,  to  determine 
the  adequacy  of  the  interface,  we  need  to  build  additional  complete  thread  packages,  such 
as  CML,  on  top  of  it.  In  the  following  paragraphs,  we  briefly  discuss  other  issues  and  open 
problems. 

Smarter  memory  management:  Bus  contention  resulting  from  poor  locality  of  ref¬ 
erence  is  a  major  problem  with  the  current  implementations.  We  suspect  that  the  major 
cause  of  this  problem  is  SML/NJ’s  allocation  strategy.  Recall  that  in  SML/NJ  all  procedure- 
frames  are  allocated  on  the  heap.  This  has  the  advantage  that  callcc  and  throa  become 
relatively  cheap.  However,  it  has  the  disadvantage  that  space  is  not  re-used  for  allocation 
until  a  garbage  collection  occurs.  If  the  size  of  the  from  and  to  spaces  of  the  copying  col¬ 
lector  exceed  the  size  of  a  processor’s  cache,  this  strategy  insures  a  cache-miss  on  aJmost 
every  allocation. 

There  are  two  basic  approaches  to  alleviating  this  problem.  One  approach  is  to  try  to  size 
the  from  and  to  spaces  (or,  more  generally,  the  allocation  space  and  youngest  generation 
of  a  multi-generation  collector)  for  each  processor  so  that  they  fit  within  the  processor’s 
cache.  This  approach  has  the  advantage  of  leaving  the  fundamental  SML/NJ  heap  allocation 
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strategy  unchanged.  However,  it  is  unattractive  on  machines  with  small  caches,  such  as  the 
Omron,  which  has  a  16KB  cache.  The  other  approach  is  to  use  a  stack  to  allocate  procedure 
frames  that  do  not  “escape.”  Stack  allocation  of  frames  would  allow  memory  to  be  re-used, 
thus  improving  locality.  To  keep  the  cost  of  callcc  and  throw  relatively  cheap,  we  could 
use  techniques  described  by  Hieb,  et  al.  [24]. 


Concurrent  GC:  ;\s  mentioned  in  Section  5.5,  concurrent  garbage  collection  is  an  im¬ 
portant  goal.  There  have  been  many  proposals  for  concurrent  and/or  parallel  garbage 
collection  [8,  10,  23,  29,  36].  Parallel  GC  on  a  shared- bus  machine  might  be  unattractive 
since  GC  is  a  memory  intensive  operation.  We  expect  that  a  bus  would  rapidly  become 
saturated  as  more  processors  “helped”  with  the  collection.  Collecting  garbage  while  the 
computation  proceeds  seems  more  appealing,  but  most  of  the  algorithms  proposed  have 
drawbacks.  Some  require  non-standard  hardware  support  [23],  while  others  require  more 
efficient  traps  and  more  flexible  virtual  memory  management  than  today’s  operating  sys¬ 
tems  provide  [8].  As  well,  most  of  the  algorithms  (with  the  notable  exception  of  the  one 
described  by  Herlihy  and  Moss  [23])  rely  on  some  global  synchronization  of  the  processes. 
If,  as  in  the  SML/NJ  system,  processes  cannot  be  stopped  at  arbitrary  points  to  initiate  a 
GC,  the  cost  of  the  synchronization  could  overwhelm  the  benefits  of  a  concurrent  GC. 


Smarter  memory  model:  At  present,  the  SML/NJ  compiler  takes  a  very  conservative 
approach  to  optimizing  accesses  to  references  and  arrays.  It  performs  all  operations  specified 
in  the  source  code  even  when  simple  dataflow  analysis  would  show  some  of  them  to  be 
superfluous.  This  conservatism  has  been  helpful  in  developing  a  multiprocessor  interface, 
since  the  compiler  makes  no  assumptions  that  might  be  violated  in  a  concurrent  setting. 
It  might  be  worthwhile,  however,  to  make  the  compiler  smarter  about  performing  such 
optimizations,  which  could  be  particularly  valuable  for  decreasing  the  cost  of  runtime  safety 
checks  on  shared  accesses. 

Such  optimizations  require  making  a  more  explicit  connection  between  shared  memory 
locations  and  the  mutual  exclusion  locks  that  govern  them,  and  exposing  that  connection 
to  the  compiler.  As  more  complex  cache  coherence  protocols  come  into  common  use,  the 
compiler  will  be  required  to  take  on  this  knowledge  in  any  case,  in  order  to  produce  correct 
code  even  in  a  conservative  fashion. 

Smarter  I/O;  Our  current  I/O  implementation  is  portable,  but  not  wholly  satisfactory. 
When  an  MP  client  thread  blocks,  we  would  ideally  like  the  current  proc  (and  physical 
processor)  to  be  made  available  to  run  other  threads;  under  most  operating  systems,  how¬ 
ever,  MP  loses  control  of  the  processor  until  the  I/O  completes  or  is  aborted  by  a  timer 
signal. 

One  option  would  be  to  use  extra  kernel  threads  to  do  the  blocking  I/O  operations.  Un¬ 
fortunately,  “handing  ofP  the  I/O  operation  is  problematic.  The  extra  kernel  threads  can 
either  busy-wait  for  an  I/O  operation  or  block.  If  they  busy-wait,  they  waste  cycles  when 
I/O  operations  are  infrequent.  If  they  block,  then  an  extra  kernel  call  is  required  to  wake 
them. 

Scheduler  activations  [2]  are  an  elegant  solution  to  this  problem.  In  the  scheduler  activation 
model,  each  task  is  given  a  virtual  multiprocessor.  However,  the  number  of  processors 
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assigned  to  a  task  at  any  point  in  time  is  controlled  by  the  kernel.  There  is  a  one-to-one 
mapping  of  a  task’s  non-blocked  kernel  threads  (scheduler  activations)  to  its  processors. 
When  a  scheduler  activation  blocks  for  I/O,  the  kernel  constructs  a  new  activation  to  run 
on  the  processor.  The  new  activation  makes  an  upcall  to  the  application  informing  it  that 
the  original  activation  blocked.  The  new  activation  is  free  to  continue  executing  other  user- 
level  threads.  When  the  I/O  completes,  the  kernel  makes  another  upcall  to  inform  the 
application. 

We  are  currently  investigating  different  approaches  to  integrating  user-level  threads  and  I/O 
under  the  Mach  3.0  operating  system.  We  are  also  investigating  a  scheduler  activation-based 
multiprocessor  interface  for  SML  (35). 


Semantics:  A  formal  specification  of  the  SML/MP  system  would  be  valuable.  Without  a 
specification,  one  of  the  most  important  advantages  of  using  SML  as  a  base  language  is  lost. 
Recently,  some  progress  has  been  made  in  the  exploration  of  a  semantics  for  a  multiprocess 
SML  [11,  40].  We  hope  to  be  able  to  use  these  results  to  give  a  somewhat  more  formal 
semantics  for  the  MP  interface. 
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A  Implementing  Mutex  Locks  in  SML 

In  the  ML  Threads  paradigm  [14],  a  mutex  lock  is  informally  associated  with  each  piece 
of  shared,  mutable  data  (c.j.,  a  ref  cell.)  When  a  thread  wants  to  perform  an  operation 
on  some  set  of  data,  it  should  first  acquire  the  associated  mutex  locks.  After  it  is  through 
with  the  operation,  it  releases  the  mutexes.  We  say  that  a  thread  “holds”  a  mutex  if  it  has 
acquired  but  not  released  the  mutex.  Only  one  thread  is  allowed  to  hold  a  given  mutex  at 
any  point  in  time. 

Mutex  locks  are  typically  implemented  by  blocking  a  thread  that  attempts  to  acquire  a 
mutex  that  is  already  held.  On  a  uniprocessor,  this  approach  is  essential  since  some  other 
thread  must  run  in  order  for  the  mutex  to  become  available.  On  a  multiprocessor,  it  makes 
better  use  of  processor  resources  than  busy-waiting. 
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■ignatur*  MUTEX  = 

Big 

typ«  autaz 

val  autax  :  unit  ->  autaz 
val  acquira  :  autaz  ->  unit 
▼al  ralaasa  :  autaz  ->  unit 
and 

signatura  PIPE  = 
tig 

typa  ’la  pipa 

val  pipa  :  unit  ->  ’la  pipa 
val  put  :  ’la  pipa  ->  ’la  ->  unit 
val  gat  :  *la  pipa  ->  ’la 
and 

signatura  THREAO.IITERIALS  s 
aig 

structura  SaXaQ  :  QUEUE 
val  raady  :  (unit  cont)  SalaQ.quaua 
val  antar.atoaic  :  unit  ->  unit 
val  laava.atomic  :  unit  ->  unit 
and 


Figure  11:  Signatures  for  Mutex  Locks  and  Pipes 

A  signature  for  Mutex  locks  is  shown  in  Figure  11.  Mutex  locks  can  be  added  to  the 
MP  thread  package  of  Section  4.6  by  using  the  functor  Par_Mutex  shown  in  Figure  12. 
The  functor  has  access  to  “internal”  components  of  the  MP  .Thread  functor  of  Figure  7  via 
parameter  T;  the  THREAO.JNTERNALS  signature  is  shown  in  Figure  11.  Each  mutex  consists 
of  a  flag  (held)  that  indicates  whether  the  mutex  is  owned  by  some  thread,  and  a  queue 
(waiters)  of  waiting  threads.  The  acquire  operation  examines  the  held  flag.  If  the  mutex 
is  not  held,  then  the  flag  is  simply  set  to  true.  If  the  mutex  is  held,  the  current  thread’s 
continuation  is  enqueued  on  waiters  and  a  new  continuation  is  dequeued  from  ready  and 
invoked.  If  ready  is  empty,  exception  Deadlock  is  raised  with  the  system  in  a  dirty  state. 
The  release  operation  attempts  to  hand  off  ownership  of  the  mutex  to  some  other  thread. 
It  does  so  by  dequeuing  a  waiting  thread  and  rescheduling  it.  If  no  threads  are  waiting  for 
the  mutex,  the  held  flag  is  cleared. 

To  prevent  a  mutex  data  structure  from  entering  an  inconsistent  state,  preemption  is  pre¬ 
vented  during  acquire  and  release  operations  by  wrapping  them  with  enter.atomic  and 
leave-atomic  calls  (see  Figure  3).  To  make  mutex  locks  work  in  a  multiprocessor  environ¬ 
ment,  the  mutex  data  structure  must  also  be  protected  against  simultaneous  update  by  two 
different  processors,  using  an  MP  spinJLock  (see  Section  4.6).  A  single  global  spin-lock 
could  be  used  to  protect  all  mutexes,  but  it  is  more  efficient  to  introduce  per-mutex  locks,  as 
shown  in  Figure  12.  We  simply  add  a  call  to  lock  the  spin_lock  before  doing  an  operation 
and  a  call  to  unlock  it  after  the  operation  is  complete. 
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functor  Par.Nutex  (structuro  IIP  :  HP 

structure  UnsafeQ  :  QUEUE 

structure  T  :  THREAD.HTERIALS)  :  MUTEX  = 

struct 

exception  Deadlock 

datatype  mtex  -  MUTEX  of  {lock  :  MP . spin.lock , 

held  :  bool  ref, 

waiters  :  unit  cont  UnsafeQ. queue} 

fun  nutex  (}  =  MUTEX  {lock  =  MP.spin.lock  (). 

held  =  ref  false, 

waiters  =  UnsafeQ. create  ()} 

fun  acquire  (MUTEX{lock, held, waiters})  = 

(T . enter.atomic  ( ) ; 

MP.lock  lock; 
if  (iheld)  then 

callcc  (fn  k  =>  (UnsafeQ. enq  waiters  k; 

let  val  k*  s  T.SafeQ.deq  T. ready 

handle  T.SafeQ.Deq  =>  raise  Deadlock 
in 

MP. unlock  lock; 

T . leave.atonic  (); 
throw  k’  () 
end) ) 

else 

(held  :=  true; 

MP. unlock  lock; 

T . leave.atoBic  ())) 

fun  release  (MUTEX{lock, held, waiters})  = 

(T . enter .atonic  ( ) ; 

MP.lock  lock; 

(T.SafeQ.enq  T. ready  (UnsafeQ. deq  waiters)) 
handle  UnsafeQ. Deq  =>  (held  :=  false); 

MP. unlock  lock; 

T . leave.atonic  ()) 
end 


Figure  12:  MP  Mutex  Locks 
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Figure  14  shows  how  the  system  can  be  linked  to  provide  a  Mutex  structure. 


B  Implementing  Pipes  in  SML 

An  alternative  to  the  shared  memory  paradigm  for  communication  and  synchronization  is 
message  passing.  In  this  paradigm,  threads  are  allowed  to  share  only  non-mutable  data 
and  abstract  communication  channels.  The  operations  provided  for  channels  and  their 
synchronization  semantics  can  vary  a  great  deal  [3];  at  a  minimum,  operations  for  sending 
values  to  channels  and  receiving  values  from  channels  are  needed.  The  CML  language  [39] 
extends  SML  by  providing  threads,  channels,  and  powerful  communication  operations  on 
channels. 

Pipes  are  a  special  case  of  channels  characterized  by  their  asymmetric  synchronization.  A 
pipe  can  be  thought  of  (and  is  usually  implemented)  as  a  queue.  Senders  put  values  into 
the  pipe  and  never  block.  Receivers  get  values  from  the  pipe,  blocking  if  necessary  until  a 
value  is  available. 

A  signature  for  pipes  is  shown  in  Figure  11.  Pipes  can  be  added  to  the  MP  thread  package 
described  in  Section  4.6  using  the  functor  Par_Pipe  shown  in  Figure  13.  A  ’  la  pipe  value 
consists  of  a  queue  of  ’  la  values  (values)  and  a  queue  of  ’  la-accepting  continuations 
(waiters).  The  get  operation  attempts  to  dequeue  a  value  and  return  it.  If  no  value  is 
available,  the  thread  is  blocked  on  the  waiters  queue  as  an  ’la-accepting  continuation.  The 
put  operation  attempts  to  “hand”  the  given  value  to  a  waiting  thread  and  reschedule  the 
thread.  The  “handing”  of  the  value  is  accomplished  by  the  maks-unit  function,  make.unit 
takes  a  ’  la-accepting  continuation,  k,  and  a  *  la  value,  x,  and  returns  a  unit-accepting 
continuation  which,  when  invoked,  throws  x  to  k.  If  put  discovers  that  there  are  no  waiters, 
it  enqueues  the  given  value  onto  the  values  queue. 

The  pipe’s  data  structures  are  protected  by  wrapping  pipe  operations  with  enter  .atomic 
and  leave-atomic,  and  adding  a  spinJ.ock  to  the  pipe  datatype  with  appropriate  calls  to 
lock  and  unlock  it. 

Figure  14  shows  how  the  system  can  be  linked  to  provide  a  Pipe  structure. 
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functor  Par_Pipe  (stmctnro  MP  :  HP 

structure  UnsafeQ  :  QUEUE 

structure  T  :  THREAD.IITERIALS)  :  PIPE  = 

struct 

exception  Deadlock 

datatype  ’la  pipe  =  PIPE  of  {values  :  ’la  UnsafeQ. queue, 

waiters:  (’la  cont)  UnsafeQ. queue, 
lock  :  NP . spin.lock} 

fun  pipe  ()  =  PIPE  {values  =  UnsafeQ. create  (), 
waiters  =  UnsafeQ . create  (), 
lock  =  NP.spin.lock  ()> 

fun  nake.unit  (k  :  ’a  cont)  (x  :  ’a)  :  unit  cont  = 

callcc  (fn  cl  =>  (callcc  (fn  c2  =>  (throw  cl  c2)); 
throw  k  x)) 

fun  put  (PIPE{ values, waiters, lock})  x  = 

(T . enter.atomic  ( ) ; 

HP. lock  lock; 

(T.SafeQ.enq  T. ready  (nake.unit  (UnsafeQ. deq  waiters)  x)) 
handle  UnsafeQ. Deq  =>  (UnsafeQ. enq  values  x); 

NP. unlock  lock; 

T. leave .atonic  ()) 

fun  get  (PIPE{ values, waiters, lock})  = 

(T . enter.atonic  ( ) ; 

HP. lock  lock; 

(let  val  X  =  UnsafeQ. deq  values 
in 

NP. unlock  lock; 

T.leave.atomic  (); 

X 

end)  handle  UnsafeQ. Deq  => 

(callcc  (fn  k  =>  (UnsafeQ. enq  waiters  k; 

let  val  c  =  T.SafeQ.deq  T. ready 

handle  T.SafeQ.Deq  =>  raise  Deadlock 
in 

NP. unlock  lock; 

T . leave.atonic  ( ) ; 
throw  c  () 
end) ) ) ) 
end 


Figure  13:  MP  Pipes 


26 


structurs  ThraadO  =  IIP_Thraad  (structure  HP  =  MP 

structure  SaleQ  -  SaleQ) 
structure  Thread  ;  THREAD  -  ThreadO 

structure  Threadlutemals  :  THREAD_IITERHALS  =  ThreadO 

structure  Mutex  =  Par_Mut ex (structure  MP  =  MP 

structure  UusafaQ  =  Queue 
structure  T  =  Threadlnteru2as) 

structure  Pipe  *  Par_Pipe(structure  MP  =  MP 

structure  (JnsaleQ  -  Queue 
structure  T  «  Threadlutemals) 


Figure  14:  Linking  Mutexes  and  Pipes 
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